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Abstract: It is generally accepted that the use of the most informative areas of the input image significantly optimizes 
visual processing. Several authors agree that, the areas of spatial heterogeneity are the most interesting for the visual system 
and the degree of difference between those areas and their surroundings determine the saliency. The purpose of our study 
was to test the hy-pothesis that the most informative are the areas of the image of largest increase in total luminance contrast, 
and information from these areas is used in the process of categorization facial expressions. Using our own program that was 
developed to imitate the work of second-order visual mechanisms, we created stimuli from the initial photographic images of faces 
with 6 basic emotions and a neutral expression. These images consisted only of areas of highest increase in total luminance 
contrast. Initially, we determined the spatial frequency ranges in which the selected areas contain the most useful information 
for the recognition of each of the expressions. We then compared the expressions recognition accuracy in images of real 
faces and those synthe-sized from the areas of highest contrast increase. The obtained results indicate that the recognition of 
expressions in synthe-sized images is somewhat worse than in real ones (73% versus 83%). At the same time, the partial loss 
of information that oc-curs due to the replacing real and synthesized images does not disrupt the overall logic of the recognition. 
Possible ways to make up for the missing information in the synthesized images are suggested. 

Keywords: expression recognition, saliency, total luminance contrast, second-order visual filters. 


Introduction 


Itis obvious that different image areas contain different volume of information. Classical experiments 
of A. Yarbus (Yarbus, 2013) have made it possible to see that the eyes ignore homogeneous areas of the 
image and, on the contrary, the gaze is directed to the most heterogeneous areas. 

Starting from the early levels of visual processing, neurons respond precisely to heterogeneities. 
So, striate neurons are activated by luminance heterogeneity in their receptive fields (Marat et al., 2013). 
However, single luminance gradients are only local heterogeneities. When it comes to the perception of 
scenes or objects, salient regions have significant spatial extent. In this case, the heterogeneity is spatial 
modulation of luminance gradients (changes in their contrast, orientation, or spatial frequency). 

The optimization of the visual perception implies finding and processing the most informative parts 
of the input image. Anumber of authors have posited that the areas that differs most from the surroundings 
are of the greatest interest to the visual system and attract the attention of the observer (Bruce and 
Tsotsos, 2009; Marat et al., 2013; Perazzi et al., 2012; Xia et al., 2015). Perhaps, mental representations 
of complex visual stimuli are formed by the information from these areas. The importance of finding the 
areas of interest determines a large number of studies aimed at finding an algorithm for identifying them 
and constructing saliency maps. However, a significant part of proposed saliency detection algorithms 
often is not based on nor considers real brain mechanisms of visual perception (Cheng et al., 2015; 
Perazzi et al., 2012; Wu, Shi and Lu, 2012). 

The human visual system has tools for detecting spatial modulations of luminance gradients in the 
input image. These are the so-called second-order visual filters (Graham, 2011), which act preattentively. 
They at a certain spatial interval combine the outputs of striate neurons (first-order filters) with the same 
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frequency tuning. First-order filters encode information about the carrier (localization, spatial frequency 
and orientation of luminance gradients). The second-order filters are activated when spatial modulation of 
the contrast, orientation or spatial frequency of these gradients (envelope) fall within their receptive fields. 
Moreover, the higher the modulation amplitude, the stronger their reaction. At the same time, it has been 
shown that different second-order filters respond to different modulations (Yavna, 2012). Since orientation 
modulations are primarily important for detecting texture boundaries (Solomon and Morgan, 2017), and 
spatial frequency modulations are important for detecting surface curvatures (Sakai and Finkel, 1995), 
it is fair to consider the filters selective to contrast modulations to be the first candidate for the role of a 
segmentation mechanism for real scenes and objects (Acik et al., 2009; Frey, Konig and Einhauser, 2007: 
‘t Hart et al., 2013). 

The aim of our study was to determine the role of image areas of largest increase in total (non-local) 
luminance contrast in visual processing using facial expression recognition tasks. The hypothesis was 
that information from these areas of the image is used in categorization. 

We chose faces as visual stimuli due to both their high social significance and multidimensionality, 
which implies separate processing of variable and invariant facial characteristics. At the same time, face 
detection and identification is characterized by unique speed (Cauchoix et al., 2014; Willis and Todorov, 
2006). The same applies to emotion recognition (Willis and Todorov, 2006; Liu and loannides, 2010; 
Vuilleumier and Pourtois, 2007). 

To test our hypothesis, we created gradient operator of total contrast (GOTC), a computer program 
that simulates the second-order filters and calculates a map of instantaneous values of the non-local 
contrast modulation function over the entire image (Babenko et al., 2021). These maps make it possible 
to create stimuli using areas of the raster image with certain modulation values. 

To a certain extent, this approach resembles the Bubbles method (Gosselin and Schyns, 2001; 
Smith et al., 2005). In both approaches the accuracy of expression recognition is studied when fragments 
of the face image are shown to the subjects. The difference is that in the Bubbles method, the fragments 
are selected randomly, and in our study, they are selected in accordance with the contrast gain. In 
addition, the Bubbles technique involves the preliminary learning of the initial set of faces, so observers 
are working with familiar faces, and this changes the range of effective spatial frequencies (Butler et 
al., 2010; Lobmaier and Mast, 2007; Smith, Volna and Ewing, 2016). Our approach allows us to use 
unfamiliar faces, which does not limit the number of stimuli and brings the experimental procedure closer 
to the real conditions of face perception. In addition, the bubbles technique can not be used to answer the 
question about the mechanisms for highlighting certain facial features. 

Prior to creating stimuli, it was necessary to determine several parameters of the model that 
simulates how second-order filters work. First of all, we had to choose the spatial frequency ranges in 
which the contrast modulation should be calculated. Since second-order filters were previously found to 
form five spatial frequency pathways that are tuned in 1 octave steps (Ellemberg et al., 2006), we decided 
to follow this scheme. 

Secondly, it was necessary to select the parameters of the apertures through which the whole 
image and its fragments are passed during the formation of facial stimuli. To keep the constant ratio 
between the carrier and envelope frequencies, the aperture diameter was reduced by a factor of 2 to 
increase the filtering frequency in cycles per image (CPI) by 1 octave, while the filtering frequency inside 
the aperture of different diameters remained constant and was equal to 4 cycles per aperture diameter. 
Such a filtering frequency was due to the data on the optimal ratio of the carrier and envelope frequencies 
in human perception of contrast modulations (Babenko, Ermakov and Bozhinskaya, 2010; Sun and 
Schofield, 2011). Similar psychophysical results were also obtained in the analysis of neuronal responses 
in V2 in primates (Willis and Todorov, 2006). Another aperture parameter is the transfer function. Based 
on the central subfield profile of the second-order filter the transfer function was set as Gaussian. 

Thirdly, the number of apertures at each filtering frequency had to be determined. The entire face 
image is described by a single aperture with the lowest filtering frequency (in CPI). We decided that since 
at each next step the filtering frequency should double, the number of selected areas should also double. 
In this case, the total diameter of apertures remains constant, and the filtering frequency in cycles per 
image increases by a factor of 2 at each frequency step. 
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Materials and Methods 


Participants 

The experiments involved 179 subjects of both sexes in total, Europeans, aged 18 to 30 years. 
All participants had normal or corrected vision and had no history of neurological or psychiatric disease. 
The subjects were informed about the upcoming procedure and gave written consent to voluntarily 
participate in the experiment. The study was approved by the local ethics committee and was performed 
in accordance with The Code of Ethics of the World Medical Association (Declaration of Helsinki). 


Equipment 

The experimental setup included an x86-64 compatible Ubuntu Linux PC with NVIDIA GeForce 
GT 730 graphics and Acer VG271U Pbmiipx monitor. Screen resolution was 2560x1440, frame rate was 
60 Hz. The monitor was calibrated with a digital luminance meter in grey scale mode. ACM (Adaptive 
Contrast Management) and HDR (High Dynamic Range) functions have been disabled. The luminance 
varied from 1 to 225 cd/m2, gamma non-linearity was standard with an exponent of 2.2. 


Stimuli 

The set of stimulus images of faces with different emotional expression was compiled from open 
access databases: MMI (Pantic et al., 2005), KDEF (Lundqvist, Flykt and Ohman, 1998), Rafd (Langner 
et al., 2010) and WSEFEP (Olszanowski et al., 2015). For further processing and preparation of the 
stimulus material we selected 70 initial full-faced photographs of male and female Caucasian faces with 
the expression of 6 basic emotions according to P. Ekman (Ekman, 1992) (fear, anger, sadness, disgust, 
surprise, happiness), and a neutral expression. Each emotion was represented by 10 faces (5 male and 
5 female). Different faces were used for different expressions. 

First, faces from different databases were equalized in average luminance (50 cd/m?) and RMS 
contrast, and size-adjusted to a circle of 880 pixels. Then, each initial inage was processed using GOTC 
that simulates the functioning of the second-order filters set with the same localization and filtering 
frequency in full range of orientation tunings. The operator is a concentric area with Difference of Gaussians 
profile. The diameter of the center of this area («window») is equal to the width of the surrounding ring. 
The filtration frequency in the window was constant and equaled to 4 cycles per window. When the size of 
the operator was 2 times reduced, the filtering frequency in cycles per image (CPI) doubled. Thus, for an 
image filtered at a frequency of 4 CPI, the window diameter is equal to the size of the entire image. For 
an image filtered at a frequency of 8 CPI, the window 2 times decreased and equaled the half the image 
size, for a filtering frequency of 16 CPI it decreased by 4 times, for 32 CPI - by 8 times and for 64 CPI - by 
16 times. The bandwidth of all filters was the same and equaled 1 octave. 

The operator window calculates spectral power of the image filtered at a given frequency in CPI. 
The spectral power of all spatial frequencies perceived by a human was calculated in the surrounding 
ring and rescaled to average power per 1 octave. The non-local contrast increase in each position was 
cal-culated as the difference between the total energy in the center of GOTC and on its periphery. The 
operator scans the entire image and builds a two-dimensional map of the contrast gain. 

As a result, 5 saliency maps were generated for each initial image (for 5 filtering frequencies). 
Then, on each map, the local maxima of the increase in contrast were ranked in descending order of the 
am-plitude value. Local maxima were selected, starting from the highest, according to the following rule: 1 
position was selected at a filtering frequency of 4 CPI, 2 positions were selected at a frequency of 8 CPI, 
4 positions were selected at a frequency of 16 CPI, and 8 positions, and on 64 CPI - 16. 

After that, we moved on to creating stimuli. First, each initial inage was filtered (with a 10" order 
Butterworth filter) in five one-octave-wide frequency bands with center frequencies of 4, 8, 16, 32, and 
64 CPI. Then, a circular aperture with a Gaussian transfer function was placed in the positions previously 
selected on the saliency maps. An already filtered image of the corresponding spatial frequency was 
passed through it. The aperture diameter was equal to the diameter of the central region of the gradient 
operator (at the lowest frequency, the entire image is transmitted; at higher frequencies, progressively 
smaller fragments of the image are transmitted). 

Facial stimuli were created by combining images transmitted through the aperture from different 
spatial frequency ranges (15 different combinations of frequency ranges were used). As a result, for each 
initial face image, 15 stimuli were created, consisting of areas of highest increase in non-local contrast. 
For experiment 1, stimuli were created in a similar way, consisting of areas of the initial image with the 
smallest increase in contrast. 

After performing all calculations, the created stimuli were scaled down to 8.5 ang deg. As a result, 
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the lowest filtering frequency, equal to 0.5 CPD, approximately corresponded to the frequency tuning of 
the lowest frequency channel in the human visual system. 


Procedure 

Prior to the experiment, observers were instructed and looked through the examples of faces 
expressing basic emotions. In the experiment, stimuli were presented in a random sequence, and their 
duration was not limited. Viewing distance was 70 cm. The observers were tasked with recognizing facial 
expressions, choosing 1 of 7 possible responses that characterize emotional expression. The responses 
were given verbally. The accuracy of recognition for each type of stimuli was calculated as a percentage 
of correct responses. 


Statistical data analysis 

ANOVA was used for statistical analysis of the results. Pairwise comparison of the percentages of 
correct responses by Student's t-test was carried out in the ANOVA procedure as post-hoc tests performed 
with Holm’s correction for multiple comparisons. 


Results 


Experiment 1. Influence of the magnitude of the increase in the contrast of the regions forming the 
stimulus on the recognition of facial expression 

In one of the previous works, it was shown that the greater the increase in the total contrast in 
the areas from which the facial stimulus is formed, the more accurately happy (joyful) and neutral faces 
are distinguished (Babenko et al., 2021). However, since in the present study it was supposed to use a 
significantly larger number of facial expressions (6 basic emotions according to Ekman and a neutral 
expression), we considered it necessary to conduct a repeated experiment in which we compared the 
recognition accuracy of 7 expressions. Now we have limited ourselves to two sets of stimuli created from 
areas with the largest and the smallest increase in total contrast. 


Procedure 

Experiment 1 involved 52 observers. 

Stimuli were created by combining selected fragments of the initial image in 4 spatial frequency 
ranges with peak frequencies of 8, 16, 32, and 64 CPI (Fig. 1). Each of the 7 facial expressions was 
represented by 20 stimuli formed from 5 female and 5 male faces (10 images were created from areas 
with the lowest non-local contrast modulation, and 10 from areas of highest contrast). A total of 140 stimuli 
((10+10)*7) were generated for this experiment. 

26 subjects were tasked to categorize facial expression when viewing stimuli created from regions 
with the lowest non-local contrast gain. The other 26 observers were tasked similarly with stimuli generated 
from regions with the highest increase in non-local contrast. Each subject was presented with 70 stimuli. 
One of the possible responses was the “I don’t know” answer. 

Data analysis was performed using one-way ANOVA (intersubject, repeated measures). The 
independent variable was the amplitude of contrast modulation of the areas that were used for synthesized 
stimuli. The dependent variable was the proportion of correct responses in the expression recognition 
task. 

Results 

Experiment 1 revealed a statistically significant effect of the contrast of the areas that were used for 
creating stimuli on the accuracy of expression recognition (F(1,50) = 699.28, p = 0.000, w2=0.931). The 
performance was significantly higher for stimuli created from areas of the initial image with the highest 
increase in non-local contrast (max) compared to stimuli created from areas with the lowest increase in 
contrast (min) (Fig. 2). 


www.ijcrsee.com 
40 


Babenko et al. (2022). Recognition of facial expressions based on information from the areas of highest increase in luminance 
contrast, International Journal of Cognitive Research in Science, Engineering and Education (IJCRSEE), 10(3), 37-51. 


Figure 1. Examples of stimuli used in experiment 1. Figure 2. Accuracy of expression recognition 

depending on the contrast gain in the areas that were used for 

An example of a stimulus created from areas | creating stimuli. “Min” is for stimuli created from areas of the 

of the initial face image with the highest contrast gain | initialimage with the lowest non-local contrast increase, “Max” 

(above). An example of a stimulus created from areas | is for stimuli created from areas of highest contrast increase. 

with the lowest gain in non-local contrast (bottom). | The y-axis shows the pro-portion of correct responses. 
The regions used to create stimuli were selected in 
the range of spatial frequencies from 5.6 to 90.2 CPI. 


highest contrast increase in the range of 4 octaves is useful for recognizing expressions and provides 
a relatively high accuracy of recognition. In stimuli created from regions with the lowest contrast gain, 
emotions are correctly determined only at a random decision level. 


Experiment 2. Accuracy of expressions recognition in facial stimuli created using the areas of 
highest increase in contrast with different combinations of spatial-frequency ranges 

After it was established that the information contained in the areas of the facial image with the 
highest contrast gain is useful for expression recognition, it was necessary to understand in which 
frequency range this information provides the best result for recognizing a particular facial expression. 

The majority of researchers agree that the average spatial frequencies are most important for face 
recognition. However, there is a variety of data on different “effective” ranges: 8-16 CPF (Costen, Parker 
and Craw, 1996; Gold, Bennett and Sekuler, 1999), 8-13 CPF (Nasanen, 1999), 11-16 CPF (Tanskanen 
et al., 2005). Collin et al. (2006) extended this range to 25 CPF. At the same time, the role of the general 
configuration in face recognition was emphasized by many studies (eg, Cheung et al. 2008; Leder and 
Bruce 2000; Maurer et al., 2002; McKone, 2008). A holistic perception of the face implies its low-frequency 
description — lower than 8 CPF (Awasthi et al., 2011; Goffaux and Rossion, 2006). 

As for facial expression recognition, many authors also prefer configuration information, and hence 
low spatial frequencies, when solving this problem (e.g., Bombari et al., 2013; Calder et al., 2000; Calvo 
and Beltran, 2014; Tanaka et al., 2012; White, 2000). Others, on the contrary, emphasize the role of internal 
features of the face and, as a result, higher spatial frequencies (Blais et al., 2012; Royer et al., 2018: 
Smith and Schyns, 2009). The fMRI data also contradicts the notion that low frequency information plays 
a critical role in the processing of facial expressions (Morawetz et al., 2011). Moreover, C. Deruelle and 
J. Fagot provide evidence in favor of the priority of high-frequency information in the task of expressions 
categorization (Deruelle and Fagot, 2005). This contradiction in experimental findings could be caused 
by the fact that different emotional expressions are encoded by different spatial frequencies (Kumar 
and Srinivasan, 2011; Pourtois et al., 2005; Stein et al., 2014; Viamings, Goffaux and Kemner, 2009; 
Vuilleumier et al., 2003). 

Thus, the objective of the second experiment was to determine the frequency ranges for the best 
recognition accuracy for each of the basic emotions, as well as neutral facial expressions, created from 
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areas of highest increase in non-local contrast. 


Procedure 

Experiment 2 involved 78 subjects. 

The stimuli were created using the areas of the initial images with the highest increase in the total 
non-local contrast. Fragments of the face were isolated in five ranges of spatial frequencies with peak 
frequencies of 4, 8, 16, 32, and 64 CPI. All possible combinations of adjacent frequency ranges were 
used. A total of 1050 facial stimuli were created (10 initial faces (5 male + 5 female) * 7 facial expressions 
* 15 combinations of spatial frequencies). 

The stimuli were presented in a random sequence. Observers chose one of 7 possible responses 
after each stimulus was presented. 


Results 

In experiment 2, we calculated the accuracy of recognition of all basic emotions and neutral facial 
expressions in stimuli created from areas of highest increase in total nonlocal contrast with different 
combinations of spatial frequency bands in the stimulus (Table 1). 


Table 1 
Expression recognition accuracy with different frequency contents of facial stimuli created from 
areas of highest increase in nonlocal contrast 


Stimulus Facial expressions 

frequency content (expression recognition accuracy in percent) Mean 
? fear anger sadness disgust neutral surprise happiness (%) 

4 42 6,3 40.4 1,9 33,2 7,1 31,7 17,8 

448 9.9 35,5 449 21,9 440 49.9 66,4 38,9 
4+8+16 60,5 58,2 51,7 69,7 69,4 69,9 78,1 65,4 
4+8+16+32 46.4 55,6 80,3 73,7 84.2 79,7 90,1 72,9 
4+8+16+32+64 35,1 50,1 70,0 72,8 85,9 81,8 97,4 70,5 
8 16,9 278 23,9 17,7 24,1 41,9 27,1 25,6 

8+16 41.9 43.7 60,8 69,5 66,9 71,5 72,1 60,9 
8+16+32 40.3 47,2 58.6 70,9 81,3 81,7 95.6 67,9 
8+16+32+64 61,8 61,0 60,5 76,3 83,1 81,9 95,6 74.3 
16 36,8 30,0 36,9 64.0 66,2 60,4 67,1 51,6 
16+32 26,8 36,0 48.1 71,4 80,1 75,4 96,5 62,1 
16+32+64 37,7 44.9 56,5 74.9 82.6 77,2 93,2 66,7 
32 41.9 23,5 31,0 426 55,8 56,3 79,7 473 
32+64 26,7 12,7 35,8 47,6 68,5 474 92,2 473 

64 16,5 97 31.4 26,2 50,1 20,8 67,3 31,7 


* here and in the following tables the integration of spatial frequency ranges in the stimuli is shown (the central 
frequency of the range is in cycles per image) 


We began the analysis of the obtained results with an assessment of the accuracy of expression 
recognition based on a low-frequency holistic description of the face. To do this, we analyzed the per- 
centage of correct responses for those trials when the image of the entire face filtered in the range of 2.8— 
5.6 CPI (central frequency 4 CPI) was presented as a stimulus. These stimuli were created by filtering the 
initial images at a specified frequency, through an aperture with a Gaussian transfer function, the diameter 
of which corresponded to the largest extent of the analyzed image (facial image height). Table 1 shows 
that in this case the accuracy of expression recognition was 17.8% (the random decision level was 14.2% 
and the confidence interval ranges was from 10.76% to 27.86% for the 95% significance level). At the 
same time, our previous findings indicate that if such facial stimuli are presented in a set of other objects 
created in a similar way, the accuracy of face detection reaches 75%. It suggests that low-frequency 
information may be sufficient to detect a face, but not enough to differentiate the emotions expressed on it. 
This confirms the idea that only low-frequency information is not enough for facial expression recognition 
(e.g., Jennings, Yu and Kingdom, 2017). 

Taking into account the data confirming the global precedence effect (Goffaux et al., 2011; Peyrin et 
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al., 2010), we studied how the accuracy of recognition changes with a gradual expansion of the bandwidth, 
starting from the lowest frequency range (2.8-5.6 CPI), by adding more and more high-frequency ranges 
(function 1 in Figure 3). As expected, expanding the range of spatial frequencies improves the results. The 
most noticeable performance increase was observed when expanding the range from 1 to 3 octaves. The 
addition of the 5th octave no longer affected the accuracy for this task. 


80 


1 2 3 4 5 


Figure 3. Accuracy of expression recognition with expanding the range of spatial frequencies that 
used for the facial stimuli. For function 1, the expansion of the frequency range starts from a frequency of 
4 CPI, for function 2 - from 8 CPI, for function 3 - from 16 CPI. On the x-axis is the width of the frequency 
band of the stimulus in octaves. The y-axis shows the percent of correct responses. 


Functions 1 and 2 in Figure 3 overlap when the bandwidth becomes equal to 3 octaves. However, 
the initial increase for function 2 was more significant. The difference is especially noticeable at a 
bandwidth of 2 octaves. If the spatial frequency increment starts from a higher frequency range (11.3-22.6 
CPI, the center frequency is 16 CPI), a significant difference between this curve and the previous ones 
arises already for a frequency band of 1 octave (function 3 in Figure 3). 

It has been shown that any range of spatial frequencies three octaves wide is sufficient for relatively 
efficient (about 70% correct responses) differentiation of expressions in facial stimuli created from areas of 
highest contrast gain. The comparison of the obtained functions was performed using two-way Repeated 
Measures ANOVA with Greenhouse-Geisser correction (main effects: Band Width (1, 2 and 3 octaves) and 
Start Frequency (4, 8 and 16 CPI), as well as their interaction). It revealed that a significant increase in the 
performance with the expansion of the frequency band of the stimuli towards higher spatial frequencies 
(F(1.699, 130.852) = 1804.298, p<0.0000, w2=0.824) depends on the frequency from which the band 
expansion begins (F( 1.661, 127.934) = 519.873, p<0.0000, w2=0.584). Significantly more information 
about facial expression is contained precisely in the range with a central frequency of 16 CPI and 1 
octave width, in comparison with other frequency ranges of the same width (Table 2). And the increase in 
performance occurs faster when expanding the range, starting from this frequency (F(3.479, 130.852) = 
246.979, p<0.0000, w2=0.472). 


Table 2 
Comparison of expression recognition accuracy for stimuli with a bandwidth of 1 octave 
frequency content of — expression recognition Student's t-test significance level 
compared stimuli accuracy adjusted for multiple 
(central frequency in comparisons 
CPI) (%) (t) (Proim) 
4/8 178/256 7.453 0.000 
4/16 178/516 32.316 0.000 
8/16 20.6/51.6 24 863 0.000 


However, if we track how the accuracy of expression recognition changes with the expansion of the 
frequency range not only towards an increase, but also towards a decrease in the spatial frequency, then 
we will get a somewhat unexpected result. For different emotions, the optimal direction of the frequency 
range expansion is evidently different (Table 3). 
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Table 3 
Comparison of recognition accuracy of different expressions for stimuli with a bandwidth of 2 
octaves 
8+16 / 16+32 Student's t-test significance level 
; adjusted for multiple 
Facial expressions 9 A 
0 comparisons 
(t) (PHolm) 
fear 41.9/ 268 6.615 0.000 
anger 43.7 | 36.0 3.364 0.068 
sadness 60.8 / 48.1 5.550 0.000 
disgust 69.5/71.4 0.841 1.000 
surprise 71.5/75.4 1.628 1.000 
neutral 66.9 / 80.1 5.774 0.000 
happiness 72.1/ 96.5 10.708 0.000 


Higher accuracy values in comparison pairs are shown in bold. 


The table shows that for the happiness and a neutral facial expression, it is really more optimal 
to add a higher spatial frequency to the range of 11.3 - 22.6 CPI the information. For the recognition of 
emotions of negative valence (fear, anger, sadness), it turned out to be more effective to expand the 
frequency range towards lower spatial frequencies. Moreover, this is less typical for anger than for other 
negative emotions. At the same time, for disgust and surprise, the expansion in both directions turned out 
to be almost equivalent. 

Considering that the range with the central frequency of 16 CPI turned out to be the most informative 
(see Figure 3), we can assume that information from this range is processed first. This information may 
be sufficient to hypothesize a probable facial expression, and the results of this preliminary analysis 
determine the direction of further expansion of the frequency range. 

This assumption does not contradict the thesis about the sequential processing of spatial 
frequencies from lower to higher ones, but at the same time, it is consistent with the data on the possibility 
of flexible use of early perceptual representation by top-down control. This allows the visual system to 
selectively use different spatial frequencies depending on how useful they are for solving a particular 
problem (Flevaris and Robertson, 2016; Oliva and Schyns, 1997). 

We then moved on to the main question in experiment 2: what combination of frequency ranges 
is most effective for recognizing each of the expressions? The result of this analysis is shown in Table 4. 


Table 4 
Combinations of spatial-frequency ranges in facial stimuli formed from areas of highest contrast 
gain, providing the best result of expression recognition 


frequency Facial expressions 
content of (expression recognition accuracy in percent) 
stimuli fear anger sadness disgust neutral surprise happiness 
4+8+16+32 464 55.6 80.3 73.7 84.2 79.7 90.1 
4+8+16+32+64 35.1 50.1 70.0 72.8 85.9 818 97.4 
8+16+32+64 61.8 61.0 60.5 76.3 83.1 81.9 95.6 


Higher accuracy values in comparison pairs are shown in bold. 


It is shown that for different facial expressions, the optimal combinations of spatial frequencies 
in the stimulus differs. So for better recognition of a neutral facial expression and happiness, the full 
frequency range, that is, all 5 octaves, is more preferable. To recognize other emotions, a band of 4 
octaves is enough. However, for stimuli expressing sadness, the effective range is shifted to a lower 
spatial frequency, while for other emotions it is shifted to a higher frequency region. It should also be noted 
that for the negative emotions (fear, anger, sadness) the optimum is quite clear (significant differences 
were obtained according to Student's test), and for other expressions it is not so obvious. 

Finding the optimal combination of spatial-frequency ranges for each facial expression allowed us 
to move on to experiment 3. 
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Experiment 3. Testing the possibility of effective expressions recognition in facial stimuli created 
with the areas of highest contrast gain. 

The results obtained indicate that the information from the areas of the face with the highest contrast 
gain is indeed useful for expression recognition. However, the question remains how much the solution to 
this problem depends on whether the subject uses all the information about the face, or only information 
from areas of highest increase in non-local contrast. To do this, under the same experimental conditions, 
it was necessary to compare the accuracy of expression recognition in photographic images of real faces 
(unfiltered) and in faces formed from fragments selected in the optimal spatial frequency ranges for each 
emotion. 


Procedure 

Experiment 3 involved 49 subjects. 

Synthesized facial stimuli expressing fear, anger, disgust, and surprise included frequencies of 8, 
16, 32, and 64 CPI. Stimuli expressing sadness were created from the ranges with central frequencies 
of 4, 8, 16, and 32 CPI. Stimuli with a neutral expression and happiness were created from fragments 
identified in the range of five octaves: 4, 8, 16, 32 and 64 CPI. The set of real face images used as 
stimuli did not overlap with the set of initial images used to create the synthesized stimuli. A total of 70 
synthesized and unfiltered facial images were used (10 faces x 7 expressions). 

The stimuli were presented in a random sequence. The exposure time was not limited. After 
training, the subjects were asked to make a decision on each presented stimulus as quickly as possible 
and press the key. Pressing the key removed the image. That way it allowed us to measure the decision 
time. Then the subjects gave a verbal response and it was recorded by the experimenter. As before, the 
range of possible responses was limited to 7 expressions. 


Results 

The results obtained in experiment 3 are shown in Figure 4. In general, the average accuracy of 
expression recognition was expectedly somewhat higher when perceiving natural facial images (83% 
correct responses) compared to synthesized stimuli (73%). For real images, the decision time was also 
shorter (by 290 ms on average). 
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Figure 4. Accuracy of expression recognition in real (continuous line) and synthesized (dotted line) 
faces. 


For statistical analysis of the obtained data we used a two-way Repeated Measures ANOVA 
(main effects: Expression (7 expressions) and Stimulus Type (real and synthesized), as well as their 
interaction). It was confirmed that the recognition accuracy of different expressions is different for both 
real and synthesized facial stimuli (F(3.284, 157.609)=68.276, p<0.0000, w2=0.530, Greenhouse-Geisser 
corrected). The accuracy of expression recognition for different types of stimulus differs significantly (F(1, 
48)=110.154, p<0.0000, w2=0.351). The curves from Figs. 4 are also different (F(4.755, 228.233)=8.911, 
p<0.0000, w2=0.101, Greenhouse-Geisser corrected). The last of these differences is determent by the 
fact that for disgust, surprise, happiness and neutral expression the recognition accuracy is higher for 
real face images, while fear, anger and sadness are actually recognized with the same accuracy as in 
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synthesized images (Table 5). 


Table 5 
Comparison of recognition accuracy of different expressions for real and synthesized facial stimuli 
Stimuli 
Facial expressions real (apmneen t PHolm 
% 

fear 47/51 -1.424 1.000 

anger 71/67 1.582 1.000 

sadness 78/71 2.610 0.246 

disgust 91/74 6.487 0.000 

surprise 95/76 7.357 0.000 

neutral 96 | 84 4 430 0.000 

happiness 99 / 88 4746 0.000 


Higher accuracy values in comparison pairs are shown in bold. 


The accuracy of expression recognition in real and synthesized facial stimuli somewhat differs. At 
the same time, real and synthesized faces formed the same sequence of gradual increase in recognition 
accuracy in a series of expressions (see Fig. 4). Statistical analysis using rank correlation coefficient 
showed that these are similar functions (Kendall’s Tb (47) = 1, p = 0.000). This may indicate that the natural 
course of the information processing is not disturbed when a real face is replaced with a synthesized 
image created from fragments with the highest contrast gain. However, there is enough information to 
recognize emotions of negative valence in synthesized stimuli, but not enough for recognition of other 
expressions. That suggests that in the synthesized facial stimuli some important information is missing. 


Discussions 


The ability of the human visual system to process huge amount of information in a very short time is 
determined by the ability to find “useful” areas in the input image. This step can be based on the search for 
spatial heterogeneities in the image using the second-order visual mechanisms. To simulate the operation 
of these mechanisms and to test the usefulness of the information extracted by them in the expression 
recognition, we created the gradient operator of total non-local contrast (GOTC). Two variables determine 
the overall contrast: the contrast of the single luminance gradients and the number of gradients in a given 
area of the image. Moreover, the second variable make a greater contribution to the total signal energy. 
Therefore, regions of interest first of all are the areas with the largest accumulation of luminance gradients. 

The design of the created operator reflects the main properties of second-order visual filters: the 
multichannel nature of the second-order mechanism (a set of operators of different sizes); bandpass 
filtering of carrier and the certain relationship between the carrier and envelope frequencies (the operator 
size has inverse relation with filtering frequency in CPI): opponent organization of the filter, which makes 
it possible to encode the amplitude of the contrast modulation (concentric organization of the GOTC); 
weighting function of the filter receptive fields (Gaussian transfer function aperture). The stimuli we used 
were Created using this gradient operator. 

In experiment 1, we showed that the recognition of 7 basic emotions in facial expressions has 
relatively high level of accuracy when it is based on the information of different spatial frequencies from 
areas of highest increase in non-local contrast (about 75% of correct responses). At the same time, facial 
stimuli created from areas with the lowest contrast gain turned out to be absolutely ineffective in terms of 
solving this problem (recognition accuracy was at a random decision level). Together with the previously 
published results (Babenko et al., 2021), this indicated that the informativeness of the image area is 
determined by the degree of its difference in the total contrast from the surroundings. 

We then analyzed the possibility of using a low-frequency representation of the entire face in 
expression differentiation tasks. In previous studies, we have shown that stimuli generated by the operator 
with a central area that matched the full size image were recognized as faces in a series of other stimuli 
with high accuracy (about 75%). When in experiment 2 the task was transformed and it was required 
not only to detect a face, but to differentiate the emotions in facial expressions, the result decreased 
significantly - to about 18% of correct responses (when a random decision level was 14%). This result is 
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consistent with widely accepted assumption that face processing should be considered as consecutive 
steps of face detection and individualization (Comfort and Zana, 2015). However, at the second stage 
of the processing the low-frequency description is no longer enough. Higher spatial frequencies provide 
additional information about the internal features of the face, which are very important for its configurative 
description (Goffaux, 2009; Piepers and Robbins, 2012). 

In experiment 2, we studied the accuracy of expression recognition in facial stimuli with different 
combinations of fragments isolated in different ranges of spatial frequencies. We found that the most 
effective frequency range is the 11.3-22.6 CPI band with a center frequency of 16 CPI. And while this 
result is not consistent with the idea of the low spatial frequencies importance in the perception of faces, 
it is consistent with the data indicating that the frequencies of the middle range are most important in 
identifying faces. It is noteworthy that in this frequency range (11.3-22.6 CPI), the GOTC more often 
singled out the eyes and mouth as areas of interest in the initial images, which are known to be very 
important for conveying emotionally significant information. 

However, unlike the experiments with Bubbles technique, we did not aim to determine the 
independent contribution of each frequency range to expression recognition, since the perception of a 
face is not a simple sum of its components (Jack et. al., 2012, but see Gold, Mundy and Tjan, 2012). It 
was important for us to determine the range of spatial frequencies for each expression that provides the 
best accuracy of recognition. 

Our results certainly do not provide an unambiguous answer to the question of how information from 
different spatial frequency pathways is combined. Previously published results in this area have also been 
somewhat controversial. There is data on that the visual system processes spatial frequencies in a certain 
sequence, from low to high (Gao and Bentin, 2011). At the same time, flexible top-down selection of spatial 
frequency channels can significantly optimize the visual processing (Flevaris and Robertson, 2016). It is 
also impossible to exclude the possibility of simultaneous processing of all frequencies. Considering the 
above, our results clearly indicate the frequency range that contains the most useful information about 
facial expressions and which would be the most reasonable to start processing with (11.3-22.6 CPI). The 
conclusion that this information can determine the strategy for further integration of spatial frequencies is 
also supported by the fact that for emotions of negative valence it is more optimal to add information from 
a lower frequency ranges, and for other facial expressions from a higher frequency ranges. 

Different frequency ranges turned out to be effective for different expressions. For the best 
recognition of neutral and joyful facial expressions, all 5 octaves were required. This result is consistent 
with the data on neutral facial expression containing a complete set of basic expressions (Lee and Kim, 
2008), and that the expression of happiness is encoded by both low and high spatial frequencies (Becker 
et al., 2012). Our data showed that in sadness recognition, 4 octaves were enough (without the highest 
frequency range). To recognize fear, anger, disgust and surprise, 4 octaves were also enough, but without 
the lowest-frequency range. 

So, as a result of the experiment 2, we have determined in what ranges of spatial frequencies the 
areas of the greatest contrast gain should be extracted in order to provide the best recognition accuracy 
of a particular expression. Now it was necessary to make sure that this is exactly the information that is 
used by the visual system when recognizing the expression of real faces. To do this, in experiment 3 we 
compared the accuracy of recognition of each expression in the perception of the images of real faces 
and stimuli formed from the optimal combination of selected fragments. Indeed, synthesized images were 
recognized somewhat worse than real ones (73% versus 83%). 

It is interesting to note that the decrease in the recognition accuracy for the synthesized stimuli was 
not found for the expressions of negative valence. In these cases, these fragmentary images of faces 
were perceived with approximately the same accuracy as real ones. Such peculiarity of the recognition 
of negative expressions is consistent with the data on that the perception of such emotions is associated 
with the activation of special mechanisms (Shaw et al., 2011; Stein et al., 2014; Vuilleumier et al., 2003). 
However, this does not dismiss the question of the insufficiency of the information contained in the selected 
areas for the recognition of other emotions. It became obvious that some of the useful information in the 
synthesized stimuli is missing. Probably the same is evidenced by the increase in reaction time. In fact, 
this was expected. 

Even though choosing the operator parameters we tried to rely on literature data, we had to make 
the choice arbitrarily in a number of cases. This concerns the number of areas that stand out in each of 
the frequency ranges, for example. An increase in their number, especially at high spatial frequencies, 
will be expected to improve the recognition rate. Another aspect that can affect accuracy of expression 
recognition is the filtering frequency in cycles per aperture. Previous research suggests that the optimal 
carrier-envelope ratio in second-order filters is 1 to 8 (Babenko, Ermakov and Bozhinskaya, 2010; Peng 
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and Schofield, 2011). However, this result was obtained in the tasks with modulated textures and not 
faces. Obviously, even a slight increase in the filtering frequency (for example, from 4 to 4.5 cycles per 
aperture) can improve the accuracy of expression recognition. 

The most interesting finding that we would like to emphasize is that numerous studies have shown 
that people recognize different expressions with different efficiency, and the recognition accuracy for 
different expressions form a certain sequence. Fear is recognized with the worst accuracy, and happiness 
with the best. In experiment 3, as in previous studies, we found a certain sequence of the increase in 
accuracy of expression recognition for images of real faces. And it was repeated with synthesized images 
created from the areas of the greatest contrast gain. This may be evidence that the replacement of a real 
image by a fragmented one, although accompanied by some general decrease in recognition accuracy, 
does not violate the general logic of the processing. 


Conclusions 


The obtained results indicate that the informative content of image areas can be determined by the 
difference between these areas and their surroundings in terms of such a physical parameter as the total 
non-local contrast. Moreover, the greater this difference, the higher the informational significance of these 
fragments. This seemingly unexpected result can be explained by the fact that the greatest contribution to 
the value of the total contrast is made not so much by the contrast of each single luminance gradients, but 
by the total number of gradients in the analyzed image area. And since each gradient is a kind of visual 
information unit, the more gradients it contains, the more informative this area would be. 

We established that information from the areas of highest increase in contrast is necessary for 
facial expression recognition. Moreover, this information is sufficient for recognition of basic expressions 
with a very high accuracy. 

These areas are characterized by spatial modulation of luminance gradients and they can be 
extracted from the input image by second-order visual filters. Thus, these filters are good candidates to 
be viewed as mechanism of selecting the areas of interest. 

Since the signal at the filter output is proportional to the amplitude of the modulation, those that are 
more activated than their neighbors gain an advantage, due to the lateral interaction between the filters. 
The locations of these filters form a saliency map, in which priorities for selective attention are distributed 
in accordance with the amplitude of the modulation. 

At the same time, the filters themselves, drawing attention to certain areas of the image, can actually 
play the role of windows through which information from these areas of the visual field is transmitted to 
post-attentive levels of processing. 

Thus, the results obtained allow us to draw the following conclusions: 

- Information from image areas of highest increase in luminance contrast is necessary and sufficient 
for recognition of basic facial expressions. 

- The second-order visual filters extract the salient regions of the image, and a signal value at the 
filter output determines its priority for attention. 

- The receptive fields of the second-order filters act as windows for the attention to extract 
information, which is then transferred to post-attentive levels of processing. 
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